Maximum entropy model parameterization with TF∗IDF weighted vector space model
نویسندگان
چکیده
Maximum entropy (MaxEnt) models have been used in many spoken language tasks. The training of a MaxEnt model often involves an iterative procedure that starts from an initial parameterization and gradually updates it towards the optimum. Due to the convexity of its objective function (hence a global optimum on a training set), little attention has been paid to model initialization in MaxEnt training. However, MaxEnt model training often ends early before convergence to the global optimum, and prior distributions with hyper-parameters are often added to the objective function to prevent over-fitting. This paper shows that the initialization and regularization hyper-parameter setting may significantly affect the test set accuracy. It investigates the MaxEnt initialization/regularization based on an n-gram classifier and a TF*IDF weighted vector space model. The theoretically motivated TF*IDF initialization/regularization has achieved significant improvements over the baseline flat initialization/regularization, especially when training data are sparse. In contrast, the n-gram based initialization/ regularization does not exhibit significant improvements. Index Terms — Maximum entropy model, TF*IDF, vector space model, n-gram classification model, model initialization, model regularization.
منابع مشابه
Intrusion Detection Using a Hybrid Support Vector Machine Based on Entropy and Tf-idf
The main functions of an Intrusion Detection System (IDS) are to protect computer networks by analyzing and predicting the actions of processes. Though IDS has been developed for many years, the large number of alerts makes the system inefficient. In this paper, we proposed a classification method based on Support Vector Machines (SVM) with a weighted voting schema to detect intrusions. First, ...
متن کاملInductive and example-based learning for text classification
Text classification has been widely applied to many practical tasks. Inductive models trained from labeled data are the most commonly used technique. The basic assumption underlying an inductive model is that the training data are drawn from the same distribution as the test data. However, labeling such a training set is often expensive for practical applications. On the other hand, a large amo...
متن کاملDesign of Thesis Topic Search Engine with Information Retrieval and Vector Space Model of TF-IDF Weighting
The development of internet makes improvement in relevant information needs. A way to get relevant information in internet is by using search engine application. Search engine application is a form of information retrieval system. Thesis searching is a problem that students face in their final study. A way to help them solve their problem is by using search engine, especially search engine that...
متن کاملText Document Clustering based on Phrase
Affinity propagation (AP) was recently introduced as an unsupervised learning algorithm for exemplar based clustering. In this paper novel text document clustering algorithm has been developed based on vector space model, phrases and affinity propagation clustering algorithm. Proposed algorithm can be called Phrase affinity clustering (PAC). PAC first finds the phrase by ukkonen suffix tree con...
متن کاملXML Document Classification using SVM
This paper describes a representation for XML documents in order to classify them. Document classification is based on document representation techniques. How relevant the representation phase is, the more relevant the classification will be. We propose a representation model that exploits both the structure and the content of document. Our approach is based on vector space model: a document is...
متن کامل